The example below works with 2.1 (the current release), but not 2.0
Near matching is another term for fuzzy matching, that is, is based on the idea that two items (such as two word tokens in a collation) should sometimes be considered matching even when they are not string equal (that is, not identical in every character). More precisely, near matching is a strategy for finding the closest match in situations where they not be an exact match.
Consider the following example from the Rus′ primary chronicle:
The last column contains slightly differing forms of fraci, but the second witness, Tro, reads fraki. Normalization, including Soundex takes care of the slight variation during Gothenburg stage 2, but we don’t want to merge c and k globally because the difference between them is usually significant.
When CollateX cannot find an exact match and there is more than one possible alignment for a token, it defaults to placing the token to the left. This means that without near matching, fraki, which does match perfectly either i or fraci, would be misaligned. With near matching, though, CollateX can recognize that fraci is more like fraki than it is like i, and thus place it in the correct (right) column.
The way near matching currently works in CollateX is that if the user has turned it on (it is off by default), after the alignment stage has been completed, the system looks for situations that cannot be resolved solely by exact matching, that is, entirely at the alignment stage. Those situations must show both of the following properties:
If and only if both of those conditions are met, CollateX compares the floating token in the shorter witness (“gray” in the example above) to all possible matches (“big” and “grey” in this example) and calculates the nearest match using a measure called edit distance or Levenshtein distance (see https://en.wikipedia.org/wiki/Edit_distance for more information). CollateX calculates the edit distance between the floating token and the tokens in the other witnesses at all of the locations where it could be placed, and determines the best placement. In a tradition with a large number of witnesses and large gaps, the number of comparisons grows quickly, which means that you don’t want to calculate edit distance except where you need to. Performing computationally inexpensive exact string matching first (in the alignment stage) and then calculating the more expensive edit distance (in the analysis stage) only where alignment has failed to give a satisfactory answer reduces the amount of computation required.
In [21]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False)
print(alignment_table)
In [22]:
from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The gray koala")
collation.add_plain_witness("B", "The big grey koala")
alignment_table = collate(collation, segmentation=False, near_match=True)
print(alignment_table)